Towards an Optimal Bit-Reversal Permutation Program
نویسندگان
چکیده
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation – trivial operations on a RAM – present non-trivial problems when designing highly-tuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebble-type game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with four-levels of memory (registers, cache, TLB, and memory), taking into account such “messy” features as block moves and set-associative caches. The insights from this analysis lead to a bit-reversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algorithms. 1. Background and related work Given binary strings a and b, let ab denote their concatenation and r(a) denote the reversal of a. Copyright 1998 IEEE. Published in the Proceedings of FOCS’98, 8-11 November 1998 in Palo Alto, CA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 088551331, USA. Telephone: + Intl. 732-562-3966. Thus, for instance, r(01101) = 10110, and r(ab) = r(b)r(a). Arrays will be indexed by binary strings. The pseudocode statement “for i = 0 to N-1” means that i iterates through all binary strings of length lg(N), where lg represents log base two. Consider the following three programs, where N is a power of 2, N = N1 N2, and A and B are arrays of length N:
منابع مشابه
Brief Announcement: Optimal Bit-Reversal Using Vector Permutations
We have developed a bit-reversal algorithm (BRAVO) using vector permute operations, which is optimal in the number of permutations, and its cache-optimal version (COBRAVO). Our implementation on PowerMac G5 shows 2– 4.5 fold improvement for small data sets and 15–75% improvement for large data sets (depending on the data element size) over the best known approach (COBRA).
متن کاملPerfect trees and bit-reversal permutations
A famous algorithm is the Fast Fourier Transform, or FFT. An eecient iterative version of the FFT algorithm performs as a rst step a bit-reversal permutation of the input list. The bit-reversal permutation swaps elements whose indices have binary representations that are the reverse of each other. Using an amortized approach this operation can be made to run in linear time on a random-access ma...
متن کاملOn the limits of cache-oblivious rational permutations
Permuting a vector is a fundamental primitive which arises in many applications. In particular, rational permutations, which are defined by permutations of the bits of the binary representations of the vector indices, are widely used. Matrix transposition and bit-reversal are notable examples of rational permutations. In this paper we contribute a number of results regarding the execution of th...
متن کاملOptimal Matrix Transposition and Bit Reversal on Hypercubes: All-to-All Personalized Communication
In a hypercube multiprocessor with distributed memory, messages have a street address and an apartment number, i.e., a hypercube node address and a local memory address. Here we describe an optimal algorithm for performing the communication described by exchanging the bits of the node address with that of the local address. These exchanges occur typically in both matrix transposition and bit re...
متن کاملPerformance of Parallel Bit-Reversal with Cilk and UPC for Fast Fourier Transform
Bit-reversal is widely known being an important program, as essential part of Fast Fourier Transform. If not carefully and well designed, it may easily take large portion of FFT application’s total execution time. In this paper, we present a parallel implementation of Bit-reversal for FFT using Cilk and UPC. Based on our previous work of creating parallel Bit-reversal using OpenMP in SPMD style...
متن کامل